Citi Bike Usage in NYC

STA 9750 Individual Final Report

Author

Sabrina Zhu

Published

December 13, 2025

Overview

Specific Question: How does station infrastructure relate to e-bike vs classic bike usage?

Key Finding: High-infrastructure areas show lower e-bike usage—the opposite of what you might expect. This suggests e-bike availability is constrained by demand, not infrastructure quality.

Load Data

Show the code
citibike_sample <- readRDS("data/processed/citibike_manhattan_sample_5pct.rds")
routes <- readRDS("data/individual_report/bike_routes.rds")
manhattan_poly <- readRDS("data/gis/manhattan_polygon.rds")

cat("Loaded", format(nrow(citibike_sample), big.mark = ","), "trips\n")
Loaded 1,618,078 trips
Show the code
cat("Date range:", as.character(min(citibike_sample$date)), 
    "to", as.character(max(citibike_sample$date)), "\n")
Date range: 2024-09-29 to 2025-10-31 

Rider Type Summary

Show the code
rider_summary <- citibike_sample[, .(
  total_trips = .N,
  electric_bikes = sum(rideable_type == "electric_bike"),
  classic_bikes = sum(rideable_type == "classic_bike")
), by = member_casual]

rider_summary[, `:=`(
  pct_electric = round(electric_bikes / total_trips * 100, 1),
  pct_classic = round(classic_bikes / total_trips * 100, 1)
)]

knitr::kable(rider_summary, 
             col.names = c("Rider Type", "Total Trips", "Electric Bikes", 
                          "Classic Bikes", "% Electric", "% Classic"),
             format.args = list(big.mark = ","))
Rider Type Total Trips Electric Bikes Classic Bikes % Electric % Classic
member 1,340,439 896,332 444,107 66.9 33.1
casual 277,639 203,233 74,406 73.2 26.8

Key Stats:

  • Overall e-bike usage: 68%
  • Casual riders consistently prefer e-bikes more than members (~73% vs ~67%)
  • Members make up 78% of all trips

Infrastructure Classification

I classified stations by infrastructure quality using two metrics:

  1. Station Density: Number of nearby stations within 500m
  2. Bike Lane Proximity: Distance to nearest bike lane
Show the code
# Get unique stations with coordinates
stations <- citibike_sample[!is.na(start_station_id) & 
                             !is.na(start_lat) & 
                             !is.na(start_lng), 
                           .(lat = first(start_lat),
                             lng = first(start_lng),
                             station_name = first(start_station_name)),
                           by = start_station_id]

cat("Found", nrow(stations), "unique stations\n")
Found 734 unique stations
Show the code
# Convert to spatial object
stations_sf <- st_as_sf(stations, coords = c("lng", "lat"), crs = 4326, remove = FALSE)

# Metric 1: Station Density (count nearby stations within 500m)
stations_buffer <- st_buffer(stations_sf, dist = 500)
stations$nearby_count <- sapply(1:nrow(stations_sf), function(i) {
  sum(st_intersects(stations_buffer[i, ], stations_sf, sparse = FALSE)) - 1
})

# Metric 2: Distance to Bike Lanes
manhattan_bbox <- st_bbox(c(xmin = -74.02, xmax = -73.90, ymin = 40.70, ymax = 40.88), 
                          crs = st_crs(4326)) |> st_as_sfc()

manhattan_routes_sf <- routes %>%
  filter(st_intersects(geometry, manhattan_bbox, sparse = FALSE)[,1]) %>%
  st_transform(4326)

stations$dist_to_bike_lane_m <- sapply(1:nrow(stations_sf), function(i) {
  distances <- st_distance(stations_sf[i, ], manhattan_routes_sf)
  min(as.numeric(distances))
})

# Create Infrastructure Score (normalized 0-1)
stations$density_score <- (stations$nearby_count - min(stations$nearby_count)) / 
  (max(stations$nearby_count) - min(stations$nearby_count))

max_dist <- quantile(stations$dist_to_bike_lane_m, 0.95)
stations$lane_proximity_score <- 1 - pmin(stations$dist_to_bike_lane_m, max_dist) / max_dist

stations$infrastructure_score <- 0.5 * stations$density_score + 0.5 * stations$lane_proximity_score

# Classify into tertiles
tertiles <- quantile(stations$infrastructure_score, probs = c(0.33, 0.67))
stations$infrastructure_level <- cut(
  stations$infrastructure_score,
  breaks = c(-Inf, tertiles[1], tertiles[2], Inf),
  labels = c("Low", "Medium", "High"),
  include.lowest = TRUE
)

cat("\nInfrastructure Classification:\n")

Infrastructure Classification:
Show the code
print(table(stations$infrastructure_level))

   Low Medium   High 
   242    250    242 
Show the code
# Save station classification
saveRDS(stations, "data/processed/stations_infrastructure.rds")

Visualization 1: Infrastructure Map

Show the code
ggplot() +
  geom_sf(data = manhattan_poly, fill = "gray95", color = "gray60", size = 0.3) +
  geom_sf(data = manhattan_routes_sf, color = "lightblue", alpha = 0.3, size = 0.5) +
  geom_point(data = stations, 
             aes(x = lng, y = lat, color = infrastructure_level, size = nearby_count),
             alpha = 0.7) +
  scale_color_manual(
    values = c("Low" = "#F44336", "Medium" = "#FF9800", "High" = "#4CAF50"),
    name = "Infrastructure Level"
  ) +
  scale_size_continuous(name = "Station Density", range = c(1, 4)) +
  labs(
    title = "Citi Bike Station Infrastructure",
    subtitle = "Stations classified by bike lane proximity and station density",
    caption = "Size indicates number of nearby stations within 500m"
  ) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank(),
    legend.position = "right",
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray40")
  )
Figure 1: Station Infrastructure Classification

E-bike Usage Analysis

Show the code
# Calculate e-bike % by station
station_variation <- citibike_sample[, .(
  ebike_pct = mean(rideable_type == "electric_bike") * 100,
  trips = .N,
  lat = mean(start_lat),
  lng = mean(start_lng)
), by = start_station_name]

# Merge with infrastructure
station_full <- merge(station_variation, 
                      stations[, .(station_name, infrastructure_level, infrastructure_score)],
                      by.x = "start_station_name", 
                      by.y = "station_name",
                      all.x = TRUE)

# Calculate bike balance (relative to average)
overall_ebike_share <- mean(citibike_sample$rideable_type == "electric_bike") * 100
station_full[, bike_balance := ebike_pct - overall_ebike_share]

cat("Overall e-bike usage:", round(overall_ebike_share, 1), "%\n")
Overall e-bike usage: 68 %

Visualization 2: Infrastructure vs E-bike Usage

Show the code
ggplot(station_full[!is.na(infrastructure_score) & trips >= 50 & ebike_pct >= 40], 
       aes(x = infrastructure_score, y = ebike_pct)) +
  geom_point(aes(color = infrastructure_level, size = trips), alpha = 0.6) +
  geom_smooth(method = "lm", color = "black", linetype = "dashed", se = TRUE) +
  scale_color_manual(values = c("Low" = "#F44336", "Medium" = "#FF9800", "High" = "#4CAF50"),
                     name = "Infrastructure Level") +
  scale_size_continuous(range = c(1, 6), name = "Total Trips") +
  labs(
    title = "Infrastructure Score vs. E-bike Usage",
    subtitle = "Each dot is a station | Dashed line shows overall trend",
    x = "Infrastructure Score (higher = better infrastructure)",
    y = "E-bike Usage (%)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 14, face = "bold"))
Figure 2: Infrastructure Score vs E-bike Usage

Key Finding: The trend line slopes downward—as infrastructure score increases, e-bike usage decreases.

Visualization 3: Bike Type Balance Map

Show the code
ggplot() +
  geom_sf(data = manhattan_poly, fill = "gray95", color = "gray60", linewidth = 0.3) +
  geom_point(data = station_full[trips >= 50],
             aes(x = lng, y = lat, color = bike_balance, size = trips),
             alpha = 0.9) +
  scale_color_gradientn(
    colors = c("#08306b", "#2171b5", "#6baed6", "#f7f7f7", "#fcbba1", "#fb6a4a", "#cb181d"),
    values = scales::rescale(c(-60, -30, -10, 0, 10, 25, 40)),
    limits = c(-60, 40),
    breaks = c(-30, 0, 20),
    labels = c("More classic", "Near avg", "More e-bikes"),
    name = "Relative E-bike Share"
  ) +
  scale_size_continuous(range = c(2, 7), name = "Station\nTrip Volume", labels = scales::comma) +
  labs(
    title = "Where Are Stations More or Less E-bike-Heavy?",
    subtitle = sprintf("Color shows difference from Manhattan average (~%.0f%% e-bikes): orange = higher, blue = lower", overall_ebike_share),
    caption = "Data: Citi Bike Manhattan trips | Stations with 50+ trips shown"
  ) +
  coord_sf(datum = NA) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank(),
    legend.position = "right",
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray40")
  )
Figure 3: Bike Type Balance Across Manhattan

Geographic Pattern:

  • Lower Manhattan (high infrastructure) = Blue = More classic bike usage
  • Upper Manhattan (low infrastructure) = Orange = More e-bike usage

Multi-Dimensional Analysis

Show the code
# Define time periods
citibike_sample[, time_period := fcase(
  hour %in% c(6,7,8,9), "Morning Rush",
  hour %in% c(10,11,12,13,14,15), "Midday",
  hour %in% c(16,17,18,19), "Evening Rush",
  hour %in% c(20,21,22,23,0,1,2,3,4,5), "Night"
)]
citibike_sample[, time_period := factor(time_period, 
  levels = c("Morning Rush", "Midday", "Evening Rush", "Night"))]

# Define volume groups
citibike_sample[, station_volume := .N, by = start_station_name]
citibike_sample[, volume_group := cut(station_volume, 
  breaks = quantile(station_volume, probs = c(0, 0.33, 0.67, 1)),
  labels = c("Low Traffic", "Medium Traffic", "High Traffic"),
  include.lowest = TRUE)]

# Calculate e-bike % by rider type, volume, and time
ebike_full <- citibike_sample[!is.na(volume_group) & !is.na(time_period), .(
  ebike_pct = mean(rideable_type == "electric_bike") * 100,
  trips = .N
), by = .(member_casual, volume_group, time_period)]

ebike_full[, rider_label := ifelse(member_casual == "casual", "Casual Riders", "Members")]

Visualization 4: E-bike Usage Heatmap

Show the code
ggplot(ebike_full, aes(x = time_period, y = volume_group, fill = ebike_pct)) +
  geom_tile(color = "white", size = 1.5) +
  geom_text(aes(label = paste0(round(ebike_pct, 1), "%")),
            size = 4.5, fontface = "bold", color = "white") +
  facet_wrap(~rider_label) +
  scale_fill_gradient2(
    low = "#3498db",
    mid = "#95a5a6",
    high = "#e74c3c",
    midpoint = 68,
    name = "E-bike\nUsage",
    limits = c(60, 80),
    breaks = seq(60, 80, 5),
    labels = function(x) paste0(x, "%")
  ) +
  labs(
    title = "E-bike Usage Patterns Across Multiple Dimensions",
    subtitle = "Blue = More classic bikes | Red = More e-bikes | Overall average = 68% e-bikes",
    x = "Time of Day",
    y = "Station Activity Level",
    caption = "Rider Type Distribution: Casual – 22% | Member – 78%"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 11),
    axis.text.y = element_text(size = 11),
    strip.text = element_text(size = 14, face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0),
    plot.subtitle = element_text(size = 11, color = "gray30"),
    plot.caption = element_text(hjust = 0, size = 9),
    panel.grid = element_blank(),
    legend.position = "right"
  )
Figure 4: E-bike Usage Patterns Across Multiple Dimensions

Key Patterns:

  • Highest e-bike usage (78.8%): Casual riders at low-traffic stations during evening rush
  • Lowest e-bike usage (63.5%): Members at high-traffic stations during morning rush
  • Gap: 15 percentage points between best and worst conditions
  • Higher e-bike usage at low-traffic stations and during evening rush, suggesting riders prefer the easier ride home after a long day, and e-bikes are available to take.

Conclusion

Answer to Research Question:

Station infrastructure has an inverse relationship with e-bike usage—high-infrastructure areas show lower e-bike rates, not higher.

Infrastructure Level E-bike Usage
Low ~72%
Medium ~72%
High ~67%

Explanation:

High-infrastructure areas attract more riders. E-bikes are preferred, so they get taken first. By the time many riders arrive, only classic bikes remain.

Connection to Overarching Question:

For bike type choice, infrastructure affects availability indirectly through demand:

  • Infrastructure effect (indirect): High-infrastructure areas create high demand, which depletes e-bikes
  • External factors (direct): Casual riders prefer e-bikes more than members; evening riders use e-bikes more than morning riders

Infrastructure doesn’t directly determine bike type preference—but it shapes availability by concentrating demand.